Objectives:

1. Multiple linear regression

Here, we’ll be modeling housing price based on house characteristics using data from SLO home sales.

  • Read in slo_homes.csv file and explore
  • Only keep observations for San Luis Obispo, Atascadero and Arroyo Grande
homes <- read_csv("slo_homes.csv") %>% 
  clean_names() %>% 
  filter(city %in% c("San Luis Obispo", "Atascadero", "Arroyo Grande"))
## Parsed with column specification:
## cols(
##   City = col_character(),
##   Price = col_double(),
##   Bedrooms = col_double(),
##   Bathrooms = col_double(),
##   SqFt = col_double(),
##   PricePerSqFt = col_double(),
##   Status = col_character()
## )
beep(10) # 1-12 sounds
  
praise() # random praise check: name
## [1] "You are super-duper!"
praise("You are totally ${adjective}! Super ${EXCLAMATION}!")
## [1] "You are totally ultimate! Super HMM!"

Are there correlations betwee variables that we’d consider while trying to model home price?

Some exploring: look at correlations between numeric variables

homes_cor <- cor(homes[2:5])
homes_cor
##               price  bedrooms bathrooms     sq_ft
## price     1.0000000 0.3115258 0.4991453 0.6618575
## bedrooms  0.3115258 1.0000000 0.7519186 0.6960233
## bathrooms 0.4991453 0.7519186 1.0000000 0.8046112
## sq_ft     0.6618575 0.6960233 0.8046112 1.0000000
corrplot(homes_cor, 
         method = "ellipse", # could also do "circle"
         type = "upper")

# Moderate correlations between numeric variables (no concern about multicollinearity here from a correlation value standpoint, but there may be some conceptual overlap between variables)

praise()
## [1] "You are impeccable!"
# See: names(praise_parts) for other things you can call in praise

# For example, customize it a bit
praise("You are totally ${adjective}! Super ${EXCLAMATION}!")
## [1] "You are totally cat's meow! Super AHHH!"

Are there reasons to believe this should be a multiple linear model linear regression actual makes sense?

ggplot(data = homes, aes(x = sq_ft, y = price))+
  geom_point()

ggplot(data = homes, aes(x = bedrooms, y = price))+
  geom_point()

But let’s start out with a complete model(includes city, bedrooms, bathrooms, sq_ft, ) using all variables we have:

# price model as a function (~) of city, bedrooms, bathrooms, sq_ft and status
homes_lm <- lm(price ~ city + bedrooms + bathrooms + sq_ft + status, data = homes)

# View it: 
homes_lm
## 
## Call:
## lm(formula = price ~ city + bedrooms + bathrooms + sq_ft + status, 
##     data = homes)
## 
## Coefficients:
##         (Intercept)       cityAtascadero  citySan Luis Obispo  
##            184130.9            -167396.4              31018.1  
##            bedrooms            bathrooms                sq_ft  
##           -161645.5              48692.0                389.1  
##       statusRegular     statusShort Sale  
##            303964.6             -19828.6
summary(homes_lm)
## 
## Call:
## lm(formula = price ~ city + bedrooms + bathrooms + sq_ft + status, 
##     data = homes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -843938 -115963  -12298  109718 3445077 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          184130.93  143368.84   1.284  0.20157    
## cityAtascadero      -167396.39   78701.95  -2.127  0.03552 *  
## citySan Luis Obispo   31018.14  114895.10   0.270  0.78766    
## bedrooms            -161645.51   49414.32  -3.271  0.00141 ** 
## bathrooms             48692.04   52408.63   0.929  0.35476    
## sq_ft                   389.12      53.17   7.318 3.43e-11 ***
## statusRegular        303964.64  105942.81   2.869  0.00488 ** 
## statusShort Sale     -19828.56   86335.35  -0.230  0.81875    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 380600 on 117 degrees of freedom
## Multiple R-squared:  0.5682, Adjusted R-squared:  0.5424 
## F-statistic:    22 on 7 and 117 DF,  p-value: < 2.2e-16
# give you the equation of the model

Price = 184130 - 167396(cityAtascadero) + 31018.14(citySan Luis Obispo)-161645.51(bedrooms)

Well that’s kind of a nightmare to look at. (bedrooms, bathrooms and suqre footage are all indicators about how big the house is) And putting it into a table could be really challenging. Now lets try another version of the model Just using sq_ft as amuasure of home size

homes_lm2 <- lm(price ~ city + sq_ft + status, data = homes)

summary(homes_lm2)
## 
## Call:
## lm(formula = price ~ city + sq_ft + status, data = homes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -653118 -120914   -8844  101402 3644738 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -105820.71  118697.40  -0.892   0.3745    
## cityAtascadero      -182586.90   80683.44  -2.263   0.0254 *  
## citySan Luis Obispo   70278.97  115846.31   0.607   0.5452    
## sq_ft                   325.03      31.54  10.306   <2e-16 ***
## statusRegular        315441.26  109749.65   2.874   0.0048 ** 
## statusShort Sale      31284.44   87356.11   0.358   0.7209    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 395100 on 119 degrees of freedom
## Multiple R-squared:  0.5268, Adjusted R-squared:  0.5069 
## F-statistic: 26.49 on 5 and 119 DF,  p-value: < 2.2e-16

AIC

AIC(homes_lm)
## [1] 3576.834
AIC(homes_lm2)
## [1] 3584.3
# we want the lower value of AIC.
# But the model making sense should be the most important thing
# adding the complecity comes more useful 

We believe the model 2 is the most sound model conceptionally

Enter, the stargazer package.

stargazer(homes_lm, type = "html")
Dependent variable:
price
cityAtascadero -167,396.400**
(78,701.950)
citySan Luis Obispo 31,018.140
(114,895.100)
bedrooms -161,645.500***
(49,414.320)
bathrooms 48,692.040
(52,408.630)
sq_ft 389.120***
(53.171)
statusRegular 303,964.600***
(105,942.800)
statusShort Sale -19,828.560
(86,335.350)
Constant 184,130.900
(143,368.800)
Observations 125
R2 0.568
Adjusted R2 0.542
Residual Std. Error 380,586.300 (df = 117)
F Statistic 21.997*** (df = 7; 117)
Note: p<0.1; p<0.05; p<0.01

Let’s answer a few questions:

  • How do we interpret each of these variables?
  • What is the reference level for city?
  • What does NOT make sense here, that might be multiply described by another variable in the model?

Try another version of the model:

homes_lm2 <- lm(price ~ city + sq_ft + status, data = homes)

And check out the results:

stargazer(homes_lm2, type = "html") # And you can customize...
Dependent variable:
price
cityAtascadero -182,586.900**
(80,683.440)
citySan Luis Obispo 70,278.970
(115,846.300)
sq_ft 325.028***
(31.538)
statusRegular 315,441.300***
(109,749.700)
statusShort Sale 31,284.440
(87,356.110)
Constant -105,820.700
(118,697.400)
Observations 125
R2 0.527
Adjusted R2 0.507
Residual Std. Error 395,084.400 (df = 119)
F Statistic 26.491*** (df = 5; 119)
Note: p<0.1; p<0.05; p<0.01

You can also use stargazer for multiple model comparisons:

stargazer(homes_lm, homes_lm2, type = "html")
Dependent variable:
price
(1) (2)
cityAtascadero -167,396.400** -182,586.900**
(78,701.950) (80,683.440)
citySan Luis Obispo 31,018.140 70,278.970
(114,895.100) (115,846.300)
bedrooms -161,645.500***
(49,414.320)
bathrooms 48,692.040
(52,408.630)
sq_ft 389.120*** 325.028***
(53.171) (31.538)
statusRegular 303,964.600*** 315,441.300***
(105,942.800) (109,749.700)
statusShort Sale -19,828.560 31,284.440
(86,335.350) (87,356.110)
Constant 184,130.900 -105,820.700
(143,368.800) (118,697.400)
Observations 125 125
R2 0.568 0.527
Adjusted R2 0.542 0.507
Residual Std. Error 380,586.300 (df = 117) 395,084.400 (df = 119)
F Statistic 21.997*** (df = 7; 117) 26.491*** (df = 5; 119)
Note: p<0.1; p<0.05; p<0.01

2. Exploring diagnostics

Now, check assumptions for normality nad homoscedasticity We can use the diagnostic plots to check assumptions about residuals, e.g.:

  • Constant variance of residuals?
  • Normally distributed residuals?
  • Also check: any notable outliers?
plot(homes_lm2) 

# yep looks like 
# for the qq plot,the data is almost perfectly normal distribution. the point 121 has higher price than our model predicts, it might because we miss some information, nor necessarily mean our model went wrong
# Residuals vs Leverage
plot(homes_lm)

Make a nice regression table:

stargazer(homes_lm2, type = "html")
Dependent variable:
price
cityAtascadero -182,586.900**
(80,683.440)
citySan Luis Obispo 70,278.970
(115,846.300)
sq_ft 325.028***
(31.538)
statusRegular 315,441.300***
(109,749.700)
statusShort Sale 31,284.440
(87,356.110)
Constant -105,820.700
(118,697.400)
Observations 125
R2 0.527
Adjusted R2 0.507
Residual Std. Error 395,084.400 (df = 119)
F Statistic 26.491*** (df = 5; 119)
Note: p<0.1; p<0.05; p<0.01

3. Predictions for home price with multiple variables

First, we’ll make a new data frame containing all variables that home_lm2 needs to make a prediction:

Make sure that the variables we create for the new data match the variables that the model will be looking for to make new predictions.

new_df <- data.frame(
  city = rep(c("San Luis Obispo", "Arroyo Grande", "Atascadero"), each = 10),
  sq_ft = rep(seq(1000, 5000, length = 10)),
  status = "Regular"
)

# rep: repet function rep(names, repet_times); sequence function seq()
# rep("Regular", 30) = "Regular"
new_df
##               city    sq_ft  status
## 1  San Luis Obispo 1000.000 Regular
## 2  San Luis Obispo 1444.444 Regular
## 3  San Luis Obispo 1888.889 Regular
## 4  San Luis Obispo 2333.333 Regular
## 5  San Luis Obispo 2777.778 Regular
## 6  San Luis Obispo 3222.222 Regular
## 7  San Luis Obispo 3666.667 Regular
## 8  San Luis Obispo 4111.111 Regular
## 9  San Luis Obispo 4555.556 Regular
## 10 San Luis Obispo 5000.000 Regular
## 11   Arroyo Grande 1000.000 Regular
## 12   Arroyo Grande 1444.444 Regular
## 13   Arroyo Grande 1888.889 Regular
## 14   Arroyo Grande 2333.333 Regular
## 15   Arroyo Grande 2777.778 Regular
## 16   Arroyo Grande 3222.222 Regular
## 17   Arroyo Grande 3666.667 Regular
## 18   Arroyo Grande 4111.111 Regular
## 19   Arroyo Grande 4555.556 Regular
## 20   Arroyo Grande 5000.000 Regular
## 21      Atascadero 1000.000 Regular
## 22      Atascadero 1444.444 Regular
## 23      Atascadero 1888.889 Regular
## 24      Atascadero 2333.333 Regular
## 25      Atascadero 2777.778 Regular
## 26      Atascadero 3222.222 Regular
## 27      Atascadero 3666.667 Regular
## 28      Atascadero 4111.111 Regular
## 29      Atascadero 4555.556 Regular
## 30      Atascadero 5000.000 Regular

Then, use the predict() function to find the predicted home prices for each combination in new_df:

# Make the predictions using new_df:
# predict(my actual model name, the values I want to precit)
predict_df <- predict(homes_lm2, newdata = new_df)


# Then bind predictions together with new_df:
full_df <- data.frame(new_df, predict_df)
full_df
##               city    sq_ft  status predict_df
## 1  San Luis Obispo 1000.000 Regular   604927.5
## 2  San Luis Obispo 1444.444 Regular   749384.3
## 3  San Luis Obispo 1888.889 Regular   893841.2
## 4  San Luis Obispo 2333.333 Regular  1038298.1
## 5  San Luis Obispo 2777.778 Regular  1182754.9
## 6  San Luis Obispo 3222.222 Regular  1327211.8
## 7  San Luis Obispo 3666.667 Regular  1471668.7
## 8  San Luis Obispo 4111.111 Regular  1616125.5
## 9  San Luis Obispo 4555.556 Regular  1760582.4
## 10 San Luis Obispo 5000.000 Regular  1905039.3
## 11   Arroyo Grande 1000.000 Regular   534648.5
## 12   Arroyo Grande 1444.444 Regular   679105.4
## 13   Arroyo Grande 1888.889 Regular   823562.2
## 14   Arroyo Grande 2333.333 Regular   968019.1
## 15   Arroyo Grande 2777.778 Regular  1112476.0
## 16   Arroyo Grande 3222.222 Regular  1256932.8
## 17   Arroyo Grande 3666.667 Regular  1401389.7
## 18   Arroyo Grande 4111.111 Regular  1545846.6
## 19   Arroyo Grande 4555.556 Regular  1690303.4
## 20   Arroyo Grande 5000.000 Regular  1834760.3
## 21      Atascadero 1000.000 Regular   352061.6
## 22      Atascadero 1444.444 Regular   496518.5
## 23      Atascadero 1888.889 Regular   640975.3
## 24      Atascadero 2333.333 Regular   785432.2
## 25      Atascadero 2777.778 Regular   929889.1
## 26      Atascadero 3222.222 Regular  1074345.9
## 27      Atascadero 3666.667 Regular  1218802.8
## 28      Atascadero 4111.111 Regular  1363259.7
## 29      Atascadero 4555.556 Regular  1507716.5
## 30      Atascadero 5000.000 Regular  1652173.4

4. Visualize it!

# point style - pch
ggplot() +
  geom_point(data = homes, 
             aes(x = sq_ft, 
                 y = price,
                 color = city,
                 pch = city),
             size = 1,
             alpha = 0.5) + 
  geom_line(data = full_df,
            aes(x = sq_ft, 
                y = predict_df,
                color = city)) +
  scale_color_manual(values = c("orange", "magenta", "black")) +
  theme_light()

5. Compare AIC values

…but statistics are not substitute for judgement!

AIC(homes_lm)
## [1] 3576.834
AIC(homes_lm2)
## [1] 3584.3
# Pretty close, but the first model has a lower AIC. Should I pick the first model just based on AIC? NO!

6. A little map teaser

our first map with {sf} package

sf: by Edzer pebesma Great beacause: sticky geometries

Data source: https://nid.sec.usace.army.mil/ords/f?p=105:19:14606625072661::NO:::

Get and check out the CA dams data:

dams <- read_csv("ca_dams.csv") %>% 
  clean_names() %>% 
  drop_na(latitude) %>% # only keep dams that have latitue, year_completed and langitude information
  drop_na(longitude) %>% 
  drop_na(year_completed)
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   RECORDID = col_double(),
##   STATEID = col_logical(),
##   LONGITUDE = col_double(),
##   LATITUDE = col_double(),
##   DISTANCE = col_double(),
##   YEAR_COMPLETED = col_double(),
##   DAM_LENGTH = col_double(),
##   DAM_HEIGHT = col_double(),
##   STRUCTURAL_HEIGHT = col_double(),
##   HYDRAULIC_HEIGHT = col_double(),
##   NID_HEIGHT = col_double(),
##   MAX_DISCHARGE = col_double(),
##   MAX_STORAGE = col_double(),
##   NORMAL_STORAGE = col_double(),
##   NID_STORAGE = col_double(),
##   SURFACE_AREA = col_double(),
##   DRAINAGE_AREA = col_double(),
##   INSPECTION_FREQUENCY = col_double(),
##   SPILLWAY_WIDTH = col_double(),
##   VOLUME = col_double()
##   # ... with 6 more columns
## )
## See spec(...) for full column specifications.
## Warning: 109 parsing failures.
##  row       col           expected actual          file
## 1337 FED_OTHER 1/0/T/F/TRUE/FALSE     CE 'ca_dams.csv'
## 1338 FED_OTHER 1/0/T/F/TRUE/FALSE     CE 'ca_dams.csv'
## 1339 FED_OTHER 1/0/T/F/TRUE/FALSE     CE 'ca_dams.csv'
## 1340 FED_OTHER 1/0/T/F/TRUE/FALSE     CE 'ca_dams.csv'
## 1341 FED_OTHER 1/0/T/F/TRUE/FALSE     CE 'ca_dams.csv'
## .... ......... .................. ...... .............
## See problems(...) for more details.

Then make sure R understands that latitude & longitude are spatial coordinates using sf::st_as_sf():

# converts our data frame to and sf pobject using st_as_Sf
dams_sf <- st_as_sf(dams, 
                    coords = c("longitude","latitude")
                    )

st_crs(dams_sf) <- 4326



# Check class:
class(dams_sf)
## [1] "sf"         "tbl_df"     "tbl"        "data.frame"

What does that mean? Now R recognizes this as spatial data.

plot(dams_sf)
## Warning: plotting the first 9 out of 67 attributes; use max.plot = 67 to plot
## all
## Warning in min(x): no non-missing arguments to min; returning Inf
## Warning in max(x): no non-missing arguments to max; returning -Inf

Cool! Now let’s get an outline for California:

NOTE: GET SAME PROJECTIONS HERE!

ca_border <- read_sf(here::here("ca_state_border"), layer = "CA_State_TIGER2016")

plot(ca_border)
## Warning: plotting the first 9 out of 14 attributes; use max.plot = 14 to plot
## all

Then plot them together with ggplot2!

ggplot() +
  geom_sf(data = dams_sf)

ggplot() +
  geom_sf(data = ca_border)

# Combine: 
ggplot() +
  geom_sf(data = ca_border,
          color = "white",
          fill = "grey40") +
  geom_sf(data = dams_sf,
          size = 1,
          alpha = 0.5,
          color = "orange") +
  theme_minimal()

beepr::beep(11)

7. A little gganimate teaser

Now show how dams have been added over time with gganimate!

ggplot() +
  geom_sf(data = ca_border,
          fill = "pink",
          color = "white") +
  geom_sf(data = dams_sf, 
          size = 1.5,
          color = "gray50") +
  theme_void() +
  labs(title = 'Year: {round(frame_time,0)}') +
  transition_time(year_completed) + # stay after show up
  shadow_mark()

# not show up in markdown
beep(4)